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TAILORED TESTING, AN APPLICATION OF STOCHASTIC APPROXIMATION* 

This paper deals with a rather difficult problem in educational (or 
mental) testing. However, the statistical reader need not be familiar with 
the basic ideas of classical mental test theory-~in particular, the notions 
of "true ecox'e 11 and ’’reliability." The inconsequence of the classical 
theory here is surprising. Perhaps this indicates that the approach to be 
used is no less fundamental than the classical theory itself. 

Consider the educator or psychologist whose purpose is to measure 
"ability" or achievement ( 02 * other trait) for a number of individuals* 

Let us denote the trait being measured by 0 and choose a scale of measure- 
ment so that 0 varies from to -h» . For present purposes, Q is 

not a chance variable; it is simply a parameter describing a person. 

Tiie educator has a large bag full of test questions called "items." 

Let us only consider items that are scored "right" or "wrong." Denote 
the score on item i by = 1 or 0 . Thimc of testing just one person; 
or if a group is to be tested, consider each person individually* We 
plan to use the individual 1 s responses to a selected subset of the items 
in order to estimate his value of G • 

If we are to do this, we need to know something about how his re- 
sponses depend on his ability, that is, something about the function 

♦This work was supported in part by contract NCKX)l4-69-C-0017, 
project designation NR 150-505* between the Personnel and Training 
Research Programs Office, Psychological Sciences Division, Office of 
Naval Research and Educational Testing Service* Reproduction in whole 
or in part is permitted for any purpose of the United States Government. 
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P i = P i (e) = Pi'ob(u i =i|e) . (i) 

This function is called the trace line or item characteristic function 
or item characteristic curve [ 5 , chapts. 1(3-17; 4J. It seems reasonable 
to assume that the function is continuous and monotonic increasing- -the 
higher the ability level, the more the probability of success. 

Typically, the item characteristic function is assumed to be a logis- 
tic function, or a cumulative normal distribution function; or possibly one 
of these functions so modified as to have its lower asymptote greater than 
0. Some typical item characteristic curves are shown in Figure 1. The 
meaning of the descriptive paremeters a , b , and c need not concern 
us at this point. 

Although the unmodified curves are cumulative distribution functions, 
it is usually not helpful to tnink of them in this way. The item charac- 
teristic function is best thought of as the regression of the item score 
u^ on 0 . 

It is common to assume that for any set of items the conditional 
probability of success on all the items when 0 is fixed is sitqp’y the 
product of the ..eparate probabilities of success. This assumption is 
known as the assunqution of local independence . It implies that items are 
uncorrelated when ability is held constant (not that they are uncorre.' ated 
in ordinary groups of examinees). 

To see the reason for the assumption^ let us suppose on the contrary 
that , the probability of simultaneous success on items i and J , 
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Figure 1. Probability of correct answer a 3 a function of 
ability, as estimated for five SAT Verbal items. 
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ls greater than P^P^ . This would mean that even when 0 is fixed, there 
is some psychological dimension that helps to determine whether items i 
and j are answered correctly. In other words, for fixed 0 , items i 
and j constitute a two -item test measuring some psychological dimension 
other than 6 . This is just the sort of situation that the assumption of 
local independence is intended to rule out. For simplicity, we wish to 
consider tests that measure a single psychological dimension rather than 
tests that measure several at the same time. 

Let us suppose that the items in the educator r s bag have been exten- 
sively pretested, so that the shape of the item characteristic curve for 
each item is known to a good approximation. Let us select from the bag a 
large set of items whose characteristic curves differ only by a translation 
along the 0 -axis. Thus, we may' describe item i by a parameter b^ , 
called the difficulty of item i , defined by the equation 

p^) = a , 

where a is pome constant, possibly l/2, chcsen by the statistician. 

Since all characteristic curves differ only by a translation, the curve 
for item i will be written as P(0 - b i ) . 

Suppose now that we administer many items to a particular student. 

By trial and error, we can find approximately the item difficulty level b 
at which the student has probability of success Of • Once we have done 
this, we can now estimate the student's ability level as approximately 
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As I have outlined it, the problem of estimating 0 is a standard prob- 
lem in stochastic approximation [8]. Specifically, the stochastic approxi- 
mation problem is to choose a sequence of items --that is, a sequence of 
b^ — in such a way that ve can estimate 0 from the resulting sequence 
of • 

The item sequence constitutes a tailored test, so called because the 
items are chosen specifically in an attempt to measure ^ne particular 
individual as effectively as possible. 

Although tailored testing can be carried out in a paoer -and -pencil 
situation, it is relatively difficult to do so. On the other hand, if a 
congputer is available, as it is in many educational institutions, then a 
large number of test items can be stored in the computer. Once an effec- 
tive rule for selecting items is provided, the computer can easily produce 
a test specially tailored to the ability level of each individual being 
tested. 

According to the Robbins -Monro stochastic approximation procedure, 
the difficulty of the (v + l) -st item is to be determined by the rule 

Vl = b v + W ' ( 2 ) 

where d^dg, ... is a suitable decreasing sequence of positive constants 
chosen in advance by the statistician [8, 7, }]. Typically, 

d v - V v > (3) 

a harmonic sequence. Each d y determines a ''step size" by which item 
difficulty is adjusted- -upwards if ■ 1 , downwards if u^ = 0 . 

O 
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This rule for choosing items has the following effect: Each time the 

student answers an item correctly, the next item administered is chosen to 
be more difficult. Each time his answer is incorrect, the next item is 
chosen to be easier. Xhe increment or decrement in item difficulty is 
large at the start of the testing, when little is known about the student's 
ability level, and becomes smaller as testing proceeds, All these prop- 
erties of the rule seem intuitively desirable. 

Unfortunately, a strict adherence to this rule would require storing 
2 n items in the computer before beginning to test, where n is the total 
number of items (presumed to be fixed in advance) to be administered to a 
single examinee. In most tests composed of dichotomously scored items, 
n > 25 , 

In order to avoid preparing and storing 2 n items, a method called 
the up-and-down m ethod, originally designed for testing explosives, can be 
used [8]. In this method, the rule for choosing items is the same as under 
the Robbins -Monro procedure except that d y in (2) is replaced by some 
predetermined fixed step size d . 

When a = 0.5 , the up-and-down rule becomes 



where Q = 1 - P . Figure 2 may be helpful in visualizing the sequence of 
items administered. It is assumed that the difficulty of the first item 
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v+1 



\ + d if \ , 

with probability P(0 - b y ) , 

b y - d if ^ = 0 , 

, with probability Q(0 - b y ) , 
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_ Item Difficulties 

■3d -2d -d 0 d 2d 5d 




Figure 2. Possible sequences of item difficulties. 
When the difficulty of the first item is = 0 , any path 
proceeding downwards from the apex represents a possible 
sequence. 
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i z b^ - 0 . Any downward path starting at the apex of the figure repre- 
sents a possible sequence of item difficulties for a single individual. 

In principle, at least, we must store one item in the computer for 
each intersection shown in Figure 2. Thus if n items are to be adminis- 
tered to the examinee, the total number of items to be prepared and stored 
before testing should be n(n + l)/2 . It is quite practical to carry on 
con$>uter -based testing with the up -and -down method, especially if a few 
shortcuts are taken to reduce the total number of items required. 

i 

It is clear from Figure 2 that the up-and-down method corresponds to 
a random walk. The transition probabilities P(9 - b^) and Q (0 - b^) 
are stationary. They depend on , but they do not depend on v when 
b v is given. 

The up-and-down method has been wide!> recommended and used in bio- 
assay application*. Dixon and Mood [2] obtained a large-sample approxi- 
mation to the maximum likelihood estimator of 0 for the up-and-down 
method, on the assumption that P(P) ic a normal ogive. The likelihood 
function is 
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where the value of b y for v > 1 depends on the values of , 

: •••> , as shown by equation (4). 

Consider the following three simple methods for scoring the student's 
responses to the items administered: 
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1. The "final-difficulty score/ 1 b , the difficulty of the 

(n + l) -th item administered, as defined by equation (2). 
Robbins and Monro showed that for suitable sequences {d^} , 
the score converges in probability to 0 as v 

becomes larg'^ 

n 

2. The "number-right score," £ u , or the "proportion- 

v=l V 

1 n 

right score." - £ u . The foxier is the score most 
n v=l v 

commonly used in scoring conventional mental tests. 

. n+1 

5. The "average -difficulty score," X = — £ b . This 

n v=2 v 



score is simply the average of the difficulty param- 
eters of the items administered, to the student, 
omitting the first (since the f.irst item is the same 
for all examinees) and including b n+1 • 

[Before going ahead, the reader may wish to make a guess as to the relative 
merits of these three scoring methods for the up-und-dovn (fixed step size) 
procedure. ] 

When the step size shrinks appropriately as n increases, as in the 
Bobbins -Monro procedure, * 8 a good estimator for 0 . When the step 

size is fixed, as in the up-and-down method, is n0 longer a con- 

sistent estimator for 0 , nor does its sampling variance approach zero 
as n becomes large. 

Surprisingly, it turns out that, when step size is fived nuirber-right 
score is perfectly correlated with b n+1 . 
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Brown lee, Hodges, and Rosenblatt [l] have shown that the average - 
difficulty score is asymptotically equivalent to the maximum likelihood 
estimator for 9 found by Dixon and Mood. Although no optimum small- 
sample properties have been proven for the average -difficulty score, it 
appears to be the score of choice for the up-end -down method at present. 

(See Wetherill [ 9 ] for empirical studies of Robbins -Monro, up-and-down, 
and other procedures •) 

The remaining problem for discussion here is the evaluation of dif- 
ferent testing procedures and of the different choices of parameters such 
as d and Of • 

It frequently happens that organizations test similar groups of stu- 
dents year after year. In this case, they have available an excellent prior 
distribution for the parameter 9 based on records of past performance. 

In such situations, the careful design of a tailored testing procedure 
would certainly be based on a Bayesian approach (see Owen [ 6 ]). 

The Bayesian approach will not be used here. For one thing, it com- 
plicates the mathematics. For present purposes, it seems better to 
present results in a form that can be used by a variety of different readers 
having a variety of prior distributions for 9 . 

Brownlee, Hodges, and Rosenblatt have a recursive method for evaluating 
C(x|e) , the expected score for any given 9 ; also for evaluating <r^(x| 0 ) , 
its conditional variance. For example, 

e ^ 3 V+l'( v+1 )® * 05 V b] = p < e * b ) ef V ve I bj-b+d] (6) 

< Q(e-b) eC^-ve | e-, b^b-d] 

+ (d-e+b) p(e-b) - (d+e-b) Q(e-b) 

ERIC 
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It is quite practical, even for sizeable n , to compute recursively on a 
computer the desired expectations and sample variances, for each of a 
variety of values of 0 . 

This brings us to an interesting problem in statistical inference. In 
bioassay 0 is typically the dose of a drug at which P(0) = a . Usually, 
any bias in the estimation procedure is a serious problem. In b5oassay, it 
would seem appropriate to use the mean-square error 

K3E = e{x - e) 2 ( 7 ) 

= <r 2 (xl e) + [e(xle) - d ] 2 

as an appropriate measure of the effectiveness of a particular procedure. 

In mental testing, on the other hand, biased estimates of 0 are 
perfectly satisfactory provided the bias is the same for each student 
tested. The fact is that in most situ*+ton& the origin and the unit of 
measurement for measuring ability is arbitrary. Thus the parameter 

6* * A + B0 , 

4 

where A and B are any constants with B > 0 , in usually quite as satis 
factory as the parameter 0 itself. For example, the number -right score 
n 

L u v is just as satisfactory au estimator as the proportion -right score 
1 ** 

~ enough these scores clearly do not estimate the same parameter. 

Both scores are satisfactory despite the fact that the sampling variance of 

2 ' 

the first is n times as large as the sampling variance of the second. 
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If we cannot discriminate among estimation procedures on the basis of 
bias and sampling variance, how are we to evaluate different estimation 
procedures with respect to each other? if the purpose- of the test is to 
separate students with higher 0 from those with lower 0 , it is clear 
that in some sense we would like 

e(xle 2 ) - e{x!e 1 ) 

to be large, where 0 2 = 9-^ x A 0 and A is a small increment in 0 , suf- 
ficiently small so that o(x|0 2 ) is approximately equal to 0 (X 1 0^ ) . For 
present purposes, the effectiveness of a psychological testing procedure 
will be described by I x (e) , a function of 0 , where 



i x (e) 



K[e(xl0*x 0 ) - e(x|0 )] 2 
<r 2 (xl0) 



( 8 ) 



The constant of proportionality K and the choice of A affect the size 

0 

of I x (e) but are irrelevant for comparing different testing procedures 
as long as the same values are used for each procedure . 

The sorts of results obtained are illustrated in Figures 3 and 4. The 
ability level of the examinee is shown along the horizontal axis in each 
figure. The effectiveness of each test procedure is shown r.long the verti- 
cal axis (although different numerical scales are used in the two figures). 
Each curve chows the effectiveness of a procedure as a function of ability 
level . 

The curves labeled "standard" are displayed to provide familiar bench 
marks. Each standard test is a conventional (not a tailored) test, composed 
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Figure 3» A conqparison of measurement efficiency for the 
up -and -down method when there ie chance success ( c = .2 ) and 
when there is not ( c » 0 )- n «= 60 . 
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Minimum Uogth of poofcad toot hevta? «rri efficiency 




Figure 4. Efficiency of three 60-item tailored testing 
procedures as compared to that of three conventional 60-item 
peaked ("standard") tests, as a function of examinee ability* 
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entirely of items of equal difficulty, the score being the number of right 
answers. For such a test, the number of right answers is known to be a 
sufficient statistic for estimating 0 . Thus a horizontal line passing 
through the maximum of a standard curve represents an upper limit to the 
effectiveness of any test procedure based on dichotomously scored items-- 
an upper limit that ordinarily would be attainable only if the examinee's 
true value of 0 were known in advance of the testing. 

The te3ts shown in Figures 3 and 4 all require administering n = 60 
items to each student. The three broken curves at the top of Figure 3 dis- 
play the effectiveness of three tailored tests with fixed step sizes 

d = .05, .20, and 1,0. In the situation illustrated, a step size of .05 

is seen to be too small. It seems that a step size around d » .20 is 
most effective for the circumstances considered. A step size of 1.0 would 
be necessary, however, if it were desired to measure accurately at aDility 
levels above 0=4 or below 0 = -4 . 

The three curves labeled c - 0 describe a tailored testing with 
items that cannot be answered correctly by guessing. The curves labeled 
c = .2 describe tailored tests composed of items that will be answered 
correctly at least 20 percent of the time, even by examinees at very low 
ability levels. The figure shows that when low-ability students can some- 
times answer items correctly (whether by random or nonrandom guessing), 
there is a considerable loss of measurement effectiveness. A part of this 
loss may be recovered by a more suitable choice of the parameter Of ; 
however, most of the loss is irretrievable. 
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Figure 4 is similar to Figure 3* The vertical axis still measures the 
effectiveness of the procedure, although expressed in different units. All 
the curves relate to tests composed of items that may be successfully an- 
swered by random guessing. The curve labeled ,T shrinking step size 11 rep- 
resents the "best 11 Robbins-Monro procedure found for a certain purpose after 
investigating a large number of such procedures. The "fixed step size 11 
curve represents similarly the "best 1 ' of a large number of up-and-dovn 
procedures. The "two-block test" represents a "best 11 two-stage procedure. 
The first stage is the administration of a single conventional test vith 
all items at the same difficulty level. The number-right score on this 
"routing test 11 is used to assign the examinee to take a single second - 
stage test, which again is a conventional test consisting of items all 
at the same difficulty level. 

The results displayed in the figure show, among other things, that the 
up-and-down procedure with fixed step size is in this case almost as ef- 
ficient as the Robbins-Monro procedure. (Note that figures are displayed 
here to illustrate types of conclusions obtainable, not to establish 
conclusions for themselves.) 
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Su pplement to Resea r ch Report Distribution List 
Requests for Specific Subjects 



Hr. J. B, Boyd Engineering, 

Personnel Research Supervisor 
The Hydro-E] ectric Pover 
Commission of Ontario 
620 University Avenue 
Toronto 2, CANADA 

Dr. Edwin A. Fleishman 

Vice President and Director Reports by H. Gulliksen 

American Institutes for Research 

8555 16th Street 
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Professor James T. Fleming 
Reading Department 
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State University of Nev York 
l!*00 Washington Avenue 
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